Some parts of this lab are inspired by Ryan Tibshirani’s course Data Mining at Carnegie Mellon University
In most exploratory applications, the number of clusters \(K\) is unknown. Determining the number of clusters is a very difficult problem. We can not visually determine the number of clusters unless the data are low-dimensional. It’s also hard to quantify exactly what makes a cluster and why one clustering scheme is better than another.
The number of clusters is highly dependent on which clustering method you choose and how you decide to choose the correct number of clusters. There is no 100% correct answer.
(George Cassella 2011) Looking at a single patch of the sky, it is difficult to determine which galaxies are part of the same cluster and which are not. Astronomers use the relative galaxy velocity to create clusters from the observations. It would be useful to know exactly how many galaxy clusters wer are looking for in our patch of sky.
The result: over 10 years of trying to determine the number of exact clusters and many different answers.
Here we will go through the methods that Mark mentioned in class. Let’s revisit our civil war dataset:
#install.packages('cluster.datasests')
library(cluster.datasets)
library(ggfortify)## Loading required package: ggplot2
# Load Data
data(us.civil.war.battles)
?us.civil.war.battles
# Seperate the names for later
civilWar<-us.civil.war.battles[,-1]
rownames(civilWar)<-us.civil.war.battles[,1]
# Plot PCs
cw.pca<-prcomp(civilWar,scale=F,center=F)
autoplot(cw.pca)+
geom_density2d()+
ggtitle('First 2 PCs of Civil War Data')The elbow method looks at the within-cluster sum of squared distances between all pairs of cluster \(k\). Note that the within cluster sum of squared distances decreases when we increase \(k\). Instead of looking for an extrema, we look for an elbow in the plot.
library(NbClust)
library(factoextra)## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
fviz_nbclust(civilWar,kmeans,method='wss')+
ggtitle('K-means clustering for Civil War Data')Based on the elbow plot, we decide to divide the civil war data into 2 clusters, though 5 also looks like a possible split.
The silhouette measures the difference between the dissimilarity between observation \(i\) and other points in the next clostest cluster and other points in the cluster in which \(i\) belongs to. Here we look at the average silhouette statistic across clusters. It is intuitive that we want to maximize this value.
fviz_nbclust(civilWar,kmeans,method='silhouette')+
ggtitle('K-means clustering for Civil War Data - Silhouette Method') Again we see that the optimal number of clusters is 2 according to this method.
When the Gap method chooses 1 cluster as optimal, there is insufficient evidence to suggest that any clusters exist.
library(cluster)
set.seed(333)
cw.gap<-clusGap(civilWar,FUN=kmeans,K.max=10,B=500,d.power=2)
fviz_gap_stat(cw.gap,
maxSE=list(method='Tibs2001SEmax',SE.factor=1)) The optimal number of clusters determined by the Gap statistic is 2.
Since there is no best way of choosing the number of clusters, we can aggregate over all of them and see which is most commonly chosen.
cw.agg<-NbClust(scale(civilWar),method='kmeans',index='alllong')## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 16 proposed 2 as the best number of clusters
## * 2 proposed 3 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
## * 1 proposed 12 as the best number of clusters
## * 7 proposed 13 as the best number of clusters
## * 1 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
#Data needs to be scaled for some methods
fviz_nbclust(cw.agg) + theme_minimal()## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 16 proposed 2 as the best number of clusters
## * 2 proposed 3 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
## * 1 proposed 12 as the best number of clusters
## * 7 proposed 13 as the best number of clusters
## * 1 proposed 15 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 2 .
Most of the methods choose 2. Let’s re-fit our k-means so that \(k=2\).
cw.kmeans<-kmeans(civilWar,2)
fviz_cluster(cw.kmeans,data=civilWar)MclustThe Mclust function determines the optimal number of clusterst according to the Bayesian Inference Criterion (BIC):
\[ \mbox{BIC} = \ln (n) \omega -2 \ln (\hat L) \] Where \(\hat L\) is the maximized likelihood function and \(\omega\) is the number of parameters estimated in the model. Typically, we want to minimized this criterion. Th function Mclust does this for you if you specify multiple clusters (default is 1 through 9):
library(mclust)## Package 'mclust' version 5.4
## Type 'citation("mclust")' for citing this R package in publications.
cw.mclust<-Mclust(civilWar)
cw.mclust$G## [1] 2
fviz_cluster(cw.mclust,frame.type='norm',geom='point')## Warning: argument frame is deprecated; please use ellipse instead.
## Warning: argument frame.type is deprecated; please use ellipse.type
## instead.
Silhouette plot quantifies whether an observation is correctly assigned to a clsuter or not.
\(s<0\) - probably in the wrong cluster
fviz_silhouette(silhouette(cw.kmeans$cluster,dist=daisy(civilWar,metric='euclidean')),main='Silhouette plot for Kmeans Clustering')## cluster size ave.sil.width
## 1 1 34 0.71
## 2 2 12 0.52
It appears that most of the observations in cluster 2 are a good fit. Many observations in cluster 1 closely resemble those in cluster 2. None appear to be incorrectly assigned to a cluster.
Last time we assumed that there were only two clusters, yet there was visual evidence to suggest that there may be more than 2 clusters. We will use the numerical data to explore the optimal number of clusters for k-medoids clustering. Just like the galaxy example, the clustering method will dictate the optimal number clusters.
First lets load the data from last time. I gave it to you pre-formatted.
load('trumpTweet_2017.Rdata')
# Var name 'tweets'
cl.vars <- c("retweet_count","favorite_count", "source", "day", "weekday", "time", "hashes", "mentions", "link")
num.vars <- c("retweet_count", "favorite_count", "day", "weekday",
"time", "hashes", "mentions")For the parts below use only the numerical data to create clusters via PAM using a manhattan distance metric.
tweet.num<-tweets[num.vars]Using the elbow method, what is the optimal number of clusters?
Using the silhouette method, what is the optimal number of clusters?
Interpret the graph created via the gap statistic method. What is the optimal number of clusters?
We must set d.power=2 in the function clusGap() and method=‘Tibs2001SEmax’ in the function fviz_gap_stat() to use the same GAP method mark used in class.
#gapstat.pam=clusGap(scale(tweet.num),FUN=pam,K.max=10,d.power=2,B=500,metric='manhattan')
#png('gapTrump_num.png',width=500,height=300)
#fviz_gap_stat(gapstat.pam,
# maxSE=list(method='Tibs2001SEmax',SE.factor=1))+
# ggtitle('PAM Clustering Gap Statistics for Trump Tweets')
# dev.off()NbClust function in the NbClust library to aggregate results across all the clustering functions? (Hint: use index=‘all’ or ‘alllong’)# Normally we would use 'all' or 'alllong' for the index, but there is a bug in one of the methods so we cannot run all of them. Run all the models in the inds vector below.
# This takes a few minutes to run.
inds <- c("kl", "ch", "hartigan", "ccc", "scott", "marriot", "trcovw", "tracew", "friedman", "rubin", "cindex", "db", "silhouette", "duda", "pseudot2", "ratkowsky", "ball", "ptbiserial", "mcclain", "dunn", "hubert", "sdindex", "dindex", "sdbw")
inds.k<-c()
for(ii in inds){
## FINISH THIS LINE HERE
tweets.nbc<-
#####
cat("index = ", ii, ":\n")
cat("index values:\n")
print(tweets.nbc$All.index)
cat("best cluster number:\n")
print(tweets.nbc$Best.nc)
inds.k<-c(inds.k,tweets.nbc$Best.nc[1])
}## index = kl :
## index values:
## NULL
## best cluster number:
## NULL
## index = ch :
## index values:
## NULL
## best cluster number:
## NULL
## index = hartigan :
## index values:
## NULL
## best cluster number:
## NULL
## index = ccc :
## index values:
## NULL
## best cluster number:
## NULL
## index = scott :
## index values:
## NULL
## best cluster number:
## NULL
## index = marriot :
## index values:
## NULL
## best cluster number:
## NULL
## index = trcovw :
## index values:
## NULL
## best cluster number:
## NULL
## index = tracew :
## index values:
## NULL
## best cluster number:
## NULL
## index = friedman :
## index values:
## NULL
## best cluster number:
## NULL
## index = rubin :
## index values:
## NULL
## best cluster number:
## NULL
## index = cindex :
## index values:
## NULL
## best cluster number:
## NULL
## index = db :
## index values:
## NULL
## best cluster number:
## NULL
## index = silhouette :
## index values:
## NULL
## best cluster number:
## NULL
## index = duda :
## index values:
## NULL
## best cluster number:
## NULL
## index = pseudot2 :
## index values:
## NULL
## best cluster number:
## NULL
## index = ratkowsky :
## index values:
## NULL
## best cluster number:
## NULL
## index = ball :
## index values:
## NULL
## best cluster number:
## NULL
## index = ptbiserial :
## index values:
## NULL
## best cluster number:
## NULL
## index = mcclain :
## index values:
## NULL
## best cluster number:
## NULL
## index = dunn :
## index values:
## NULL
## best cluster number:
## NULL
## index = hubert :
## index values:
## NULL
## best cluster number:
## NULL
## index = sdindex :
## index values:
## NULL
## best cluster number:
## NULL
## index = dindex :
## index values:
## NULL
## best cluster number:
## NULL
## index = sdbw :
## index values:
## NULL
## best cluster number:
## NULL
ggplot(data.frame(k=inds.k),aes(x=k))+
geom_bar()## Error in FUN(X[[i]], ...): object 'k' not found
Based on the results from above, what is the optimal number of clusters for K-mediods clustering? Run k-medoids clustering using the optimal number of clusters \(K\) and visualized your results.
Use a silhouette plot to determine if any observations were missclustered.
(Advanced aka if you have time) For the parts below use the numerical and categorical data to create clusters via Agglomerative clustering with the “ward” method.
Here is what agg. clustering gives with two clusters:
library(cluster)
library(ggplot2)
tweet.cat<-tweets[cl.vars]
tweets.dist.cat <- daisy(tweet.cat,
metric = "gower",stand=T)## Warning in daisy(tweet.cat, metric = "gower", stand = T): setting 'logical'
## variable 9 to type 'asymm'
tweets.mds <- data.frame(tweet.cat,cmdscale(tweets.dist.cat))
tweet.agnes<-agnes(tweets.dist.cat,diss=T,method='ward')
grp.agnes<-cutree(tweet.agnes,k=2)
tweet.agnes.df<-data.frame(tweets.mds,cluster=factor(grp.agnes))
ggplot(tweet.agnes.df,
mapping = aes(x = X1,
y = X2,col=cluster)) +
geom_point()Note that using categorical factors limit the methods we are able to use to determine the optimal number of clusters. Out of the Elbow, Average Silhouette, and Gap Statistics methods R is only able to apply Average Silhouette.
Using the silhouette method, what is the optimal number of clusters?
Use the NbClust function in the NbClust library to aggregate results across all the clustering functions? Using the NbClust function with only a distance matrix we are able to use a subset of the indexes: frey,mcclain,cindex,silhouette, and dunn.
Based on the results from above, what is the optimal number of clusters for K-mediods clustering? Run k-medoids clustering using the optimal number of clusters \(K\) and visualized your results.
Applied in R using the dbscan function. Mark highlighted some huge advantages using this algorithm in class.
In order to estimate \(\epsilon\) use kNNdistplot in the dbscan package.
library(dbscan)
kNNdistplot(scale(tweet.num),k=2)
abline(h=2.2,lwd=2,col='red',lty=3)
abline(h=1.2,lwd=2,col='red',lty=3)Next we implement dbscan using the epsilon determined above:
tweet.db<-dbscan(scale(tweet.num),eps=2,minPts = 8)
fviz_cluster(tweet.db,data=scale(tweet.num),geom='point')table(tweet.db$cluster)##
## 0 1 2 3
## 37 1925 287 46
Create an average silhouette plot using your DB Scan results.
Do you think this data set is a good fit for Density Based Clustering? Why or why not?
Please fill this out at the end of lab each week. This will be anonymous and will NOT be used for attendance. At the end you will have the opporutnity to leave any unanswered questions. If enough people ask the same question I will post an answer in supplementary material.